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Abstract — Clustering analysis method is one of the main 
analytical methods in data mining; the method of clustering 
algorithm will influence the clustering results directly. This 
paper discusses the standard k-means clustering algorithm 
and analyzes the shortcomings of standard k-means 
algorithm. This paper also focuses on web usage mining to 
analyze the data for pattern recognition. With the help of k- 
means algorithm, pattern is identified. 

Index terms — Pattern recognition, web mining, k-means 
clustering, nearest neighbour, pattern recovery. 

I. Introduction 

Clustering problems arise in many different applications, 
such as data mining, web mining, data compression, pattern 
recognition pattern classification etc. The notion of what 
constitutes a good cluster depends on the application and 
there are many methods for finding clusters subject to various 
criteria's. 

Among clustering formulations that are based minimizing 
a formal objective function, perhaps the most widely used 
and studied is k-means clustering. A^-mean clustering is a 
method of cluster analysis which aims to partition n 
observations into k clusters in which each observation 
belongs to the cluster with the nearest mean. 

Given a set of observations (x,, x_, x ), where each 

x 1 2 n' 

observation is a li-dimensional real vector, &-means clustering 
aims to partition the n observations into k sets (k d" n). See 
"Eq-(D". 

S = {5 r 5,, so as to minimize the within-cluster sum of 

squares (WCSS): 

k 

I t PfrftIP (1) 

1=1 X i£S j 

Where fj is the mean of points in S.. 

II. Standard Algorithm 

The most common algorithm uses iterative refinement 
technique. Given an initial set of k means ml,m2..., the 
algorithm proceeds by alternatig between two steps: 
Assignment Step 

Assign each observation to the cluster whose mean is 
closest to it ("Eq. (2)"). 

sif'MxT, : llMm^l-SiV^ 1 "!] fora11 > ( 2 ) 

©2013ACEEE 
DOL03.LSCS.2013.3.578 



Update Step 

Calculate the new means to be the centroids of the 
observations in the new clusters ("Eq. (3)"). 

M^M £I«) ZX j} } (3) 

The algorithm has converged when the assignments no 
longer change. Commonly used initialization methods are 
Forgy and Random Partition. The Forgy method randomly 
chooses k observations from the data set and uses these as 
the initial means. The Random Partition method first randomly 
assigns a cluster to each observation and then proceeds to 
the update step, thus computing the initial mean to be the 
centroid of the cluster's randomly assigned points. The Forgy 
method tends to spread the initial means out, while Random 
Partition places all of them close to the center of the data set. 
According to Hamerly, the Random Partition method is 
generally preferable for algorithms such as the ^-harmonic 
means and fuzzy fe-means. 

For expectation maximization and standard fc-means 
algorithms, the Forgy method of initialization is preferable. 

III. .Demonstration Of Standard Algorithm 

Following steps shows the demonstration of &-means 
algorithm: 

1. ) k initial "means" are randomly generated within the data 
domain. 

2. ) k clusters are created by associating every observation 
with the nearest mean. 

3. ) The centroid of each of the ^-clusters becomes the new 
mean. 

4. ) Steps 2 and 3 are repeated until convergence has been 
reached. 

Fig. 1 . to fig 4 demonstrates above steps: 

Input for the algorithm includes data from log file shown 
in table 1 . We consider x-axis as patterns such as pi ,p2,p3 etc 
for pattern 1 ,pattern 2,pattern 3 etc respectively and then we 
proceed with the algorithm. The output of this algorithm will 
consists of clusters of patterns as specified by the user. The 
input table is shown in table I. 

This input is obtained by performing preprocessing steps 
on the log file shown in fig. 2. 

IV. Advantages And Disadvantages 
Strength of K-means algorithm: 
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Fig 1. Stepl (k=2) 
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Fig 2.Step 2 
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Fig 3. Step 3 

The method is relatively scalable. 

Efficient in processing large data sets. 

Complexity of algorithm is O(nkt), where n is the total number 

of objects, k is the number of clusters, and t is the number of 

iterations. 

Weakness of K-means algorithm: 

Need to specify k, the number of clusters, in advance. 

Unable to handle noisy data and outliers. 
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Fig 4. Step 4 
Table I. Input for alforithm 



Pattern 


Time in 
minutes 


PI 


120 


P2 


50 


P3 


40 


P4 


12 


P5 


37 



login ab@gmail.com 04.02.13 04:08:31 
login ab@gmail.com 04.02.13 04:21:30 
view ab@gmail.com 04.02.13 04:22:07 
upload/pdf/s.pdf 
viewout ab@gmail.com 04.02.13 04:22:17 
logout ab@gmail.com 04.02.13 04:33:12 
login adi4u_2010@yahoo.in 04.03.13 
16:14:05 

view adi4u 2010©vahoo.in 04.03.13 

Fig 5. Log data 

Conclusions 

An efficient form of k-means algorithm is provided. The 
algorithm is easy to implement and relatively scalable as 
complexity of algorithm is O. Only problem with the algorithm 
is number of clusters have to be specified in advance. The 
output of algorithm depends upon the no. of clusters 
specified. 

For k=2,we get following clusters: 
Cluster 1: 
PI 
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PI 

P2 
P4 

Cluster2: 
P2 
P2 
P3 
P5 
PI 
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